Have you wondered about the importance of trends in the businesses around you and how that could determine their degree of success? Manufacturing businesses are an important sector of our life where crucial aspects of it are dependent on these businesses. The Kinder Institute of Urban Research at Rice University in Houston has an interesting data set about the manufacturing businesses in Houston. The data set provides insight into the different business types and their corresponding revenue categories in 2018 - 2021. Analyzing the data will be done by using the R language and its dplyr and ggplot2 packages—known for data science, manipulation, and visualization—for trend analysis. Also, shiny, and plotly applications may be implemented for data visualization. These tools will illustrate trends of manufacturing businesses in Houston, emphasizing the successful businesses in the area. To sum up, analyzing this data and highlighting these trends will help many people, of them are individuals in academia who might use the trends in their studies and research.
The following packages are attached to the IDE for analyzing the data.
library(rmarkdown, quietly = TRUE) # this package is for dynamic markdown (.rmd) file
library(tidyverse, quietly = TRUE) # this package is main package for data manipulation and tidying
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr, quietly = TRUE) # this package is for tidying the data
library(tibble, quietly = TRUE) # this package is for creating tibbles and re-imaging the data frame
library(dplyr, quietly = TRUE) # this package is for data manipulation
library(ggplot2, quietly = TRUE) # this package is for data visualization
library(plotly, quietly = TRUE) # this package is for interactive plots
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(scales, quietly = TRUE) # this package is for scaling data
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(colorspace, quietly = TRUE)
# this package is for color plalette for large number of categories for visualization
library(readr, quietly = TRUE) # this package is for reading data from
# delimited file, such as .csv file
library(stringr, quietly = TRUE) # this package is for text data, for the `str_remove()` function in the code
The following data set will be used in this project.
# import `vi_manuf2018plus` data set
vi_manuf2018plus <- read_csv("vi_manuf2018plus.csv")
## Rows: 17160 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): naics_code, naics_code_descr, revenue_category, employee_size
## dbl (2): yr, number_of_biz
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(vi_manuf2018plus)
Source: Data Axle & Kinder Institute For Urban Research-Urban Data Platform Team. (2021). Counts of Manufacturing Businesses - Houston MSA - 2018-2021 (Version 1) [dataset]. Rice University. https://doi.org/10.25612/837.RDB506Y63571.
Hold Shift + Click to open file.
One peculiarity that the website mentions about the source of the data: It is uncertain if a change in the number of firms inventoried from one year to the next is due to businesses emerging or vanishing in a specific location, or to changes in Data Axle’s data-gathering procedures. The source datasets have a YrEstab (year established) variable, although it is null in 75% of the records. The YrAppYp (the year the business first appeared in the Yellow Pages directory) is never null, but it cannot be used to identify when a business was started; in this subset, it only equals the year established value (when known) in around 60% of cases. More information about the source of the data, according to the kinderup.org: The dataset is available for download on this website and is organized into three tables. Most users will just require the vi_manuf2018plus table. Users who wish to deal solely with coded values and a codebook should use the manuf2018plus and manuf2018plus_code tables. Please be aware that rows displaying subtotals for combinations of the naics_code, employee_size, and revenue_category variables are included. The 2018 dataset has a date stamp of 2/12/2019. The datasets for 2019, 2020, and 2021 include date stamps from May or June of their respective years.
# glimpse function display info about the data set
glimpse(vi_manuf2018plus)
## Rows: 17,160
## Columns: 6
## $ yr <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,…
## $ naics_code <chr> "311", "311", "311", "311", "311", "311", "311", "311…
## $ naics_code_descr <chr> "Food Manufacturing", "Food Manufacturing", "Food Man…
## $ revenue_category <chr> NA, "Less than $100K", "$100K-$200K", "$200K-$500K", …
## $ employee_size <chr> "1-4", "1-4", "1-4", "1-4", "1-4", "1-4", "1-4", "1-4…
## $ number_of_biz <dbl> 13, 70, 235, 2087, 166, 75, 28, 3, 0, 0, 0, 0, 0, 0, …
Now, data cleaning. The data set will be scanned and checked for any missing or duplicate values. But before that, checking data types in the data set comes first.
# view data types of columns
data_types <- sapply(vi_manuf2018plus, class)
# print
data_types
## yr naics_code naics_code_descr revenue_category
## "numeric" "character" "character" "character"
## employee_size number_of_biz
## "character" "numeric"
Now, check for missing values. This code will return the total number of occurrences of the NA values in the data set.
# check for missing values
missing_val <- vi_manuf2018plus %>%
summarize_all(function(x) sum(is.na(x)))
# print
missing_val
## # A tibble: 1 × 6
## yr naics_code naics_code_descr revenue_category employee_size number_of_biz
## <int> <int> <int> <int> <int> <int>
## 1 0 0 0 1144 1320 0
The NA values rows will create inaccuracies and must be deleted.
# deleting rows with missing values
vi_manuf2018plus <- na.omit(vi_manuf2018plus)
Upon eye-blinking the data set. There are different values other than the ones in the column. This code will filter and find the count of different values that occurred in the data set.
# filter for outliers
outliers <- vi_manuf2018plus %>%
apply(2, function(x) !any(str_detect(x, regex("Total", ignore_case = TRUE))))
outliers <- vi_manuf2018plus %>%
apply(2, function(x) !any(str_detect(x, regex("All businesses with NAICS codes starting with 311-316, 321-327, 331-337, 339", ignore_case = TRUE))))
# find the occurrences of outliers
outliers_total_count <- sum(apply(vi_manuf2018plus, 2, function(x) grepl("Total", x, ignore.case = TRUE)))
# find the occurrences of outliers
outliers_total_all <- sum(apply(vi_manuf2018plus, 2, function(x) grepl("All businesses with NAICS codes starting with 311-316, 321-327, 331-337, 339", x, ignore.case = TRUE)))
# print
outliers_total_count
## [1] 2288
outliers_total_all
## [1] 672
Due to the inaccuracy this value will cause in the interpretation of this data set, the rows with this different value will be dropped as well.
# delete the outliers values from the data set
vi_manuf2018plus <- vi_manuf2018plus %>%
rowwise() %>%
filter(!any(str_detect(across(everything()), regex("Total", ignore_case = TRUE))))
vi_manuf2018plus <- vi_manuf2018plus %>%
rowwise() %>%
filter(!any(str_detect(across(everything()), regex("All businesses with NAICS codes starting with 311-316, 321-327, 331-337, 339", ignore_case = TRUE))))
# assign the outliers count variable to the deletion code for later inspection
outliers_total_count <- sum(apply(vi_manuf2018plus, 2, function(x) any(str_detect(x, regex("Total", ignore_case = TRUE)))))
outliers_total_all <- vi_manuf2018plus %>%
rowwise() %>%
filter(!any(str_detect(across(everything()), regex("All businesses with NAICS codes starting with 311-316, 321-327, 331-337, 339", ignore_case = TRUE))))
In order to check that the data is clean, the total number of instances must be zeros for both inaccuracies.
# check and debug after data clean up
# check for missing values
# calculate the total number of NA values in the data set
total_na <- sum(is.na(vi_manuf2018plus))
# print the total number of NA values
total_na
## [1] 0
# check outliers number of occurrences
outliers_total_count
## [1] 0
outliers_total_all
## # A tibble: 12,012 × 6
## # Rowwise:
## yr naics_code naics_code_descr revenue_category employee_size
## <dbl> <chr> <chr> <chr> <chr>
## 1 2018 311 Food Manufacturing Less than $100K 1-4
## 2 2018 311 Food Manufacturing $100K-$200K 1-4
## 3 2018 311 Food Manufacturing $200K-$500K 1-4
## 4 2018 311 Food Manufacturing $500K-$1M 1-4
## 5 2018 311 Food Manufacturing $1M-$2.5M 1-4
## 6 2018 311 Food Manufacturing $2.5M-$5M 1-4
## 7 2018 311 Food Manufacturing $5M-$10M 1-4
## 8 2018 311 Food Manufacturing $10M-$20M 1-4
## 9 2018 311 Food Manufacturing $20M-$50M 1-4
## 10 2018 311 Food Manufacturing $50M-$100M 1-4
## # ℹ 12,002 more rows
## # ℹ 1 more variable: number_of_biz <dbl>
# check data types for each column again to check
data_types
## yr naics_code naics_code_descr revenue_category
## "numeric" "character" "character" "character"
## employee_size number_of_biz
## "character" "numeric"
All the variables will be used in the analysis. Here is a
tibble display of the clean data.
# illustrate data set as tibble
tibble <- as_tibble(vi_manuf2018plus)
# print
tibble
## # A tibble: 12,012 × 6
## yr naics_code naics_code_descr revenue_category employee_size
## <dbl> <chr> <chr> <chr> <chr>
## 1 2018 311 Food Manufacturing Less than $100K 1-4
## 2 2018 311 Food Manufacturing $100K-$200K 1-4
## 3 2018 311 Food Manufacturing $200K-$500K 1-4
## 4 2018 311 Food Manufacturing $500K-$1M 1-4
## 5 2018 311 Food Manufacturing $1M-$2.5M 1-4
## 6 2018 311 Food Manufacturing $2.5M-$5M 1-4
## 7 2018 311 Food Manufacturing $5M-$10M 1-4
## 8 2018 311 Food Manufacturing $10M-$20M 1-4
## 9 2018 311 Food Manufacturing $20M-$50M 1-4
## 10 2018 311 Food Manufacturing $50M-$100M 1-4
## # ℹ 12,002 more rows
## # ℹ 1 more variable: number_of_biz <dbl>
Fundamentally, yr is the year that the data was collected.
naics_code and naics_code_descr are for the
NAICS code mentioned above, and revenue_category has the
revenues of those businesses categorized in different ranges.
employee_size is a range of the number of employees in that
company, while number_of_biz is the count of businesses in
the criteria that is in the row.
This data set has naics_code_descr,
revenue_category, and employee_size all with
different categories. All distinct categories in these 3 columns will be
shown as a guide to interpret the data set.
# find distinct categories in each column
# for naics code
descr_categories <- vi_manuf2018plus %>%
select(naics_code) %>%
distinct()
# for revenue category
revenue_categories <- vi_manuf2018plus %>%
select(revenue_category) %>%
distinct()
# for employee size
empl_size_categories <- vi_manuf2018plus %>%
select(employee_size) %>%
distinct()
# combine all variables of distinct occurrences into a list
distinct <- list(
code_categories = unique(vi_manuf2018plus$naics_code),
descr_categories = unique(vi_manuf2018plus$naics_code_descr),
revenue_categories = unique(vi_manuf2018plus$revenue_category),
empl_size_categories = unique(vi_manuf2018plus$employee_size)
)
# print
distinct
## $code_categories
## [1] "311" "312" "313" "314" "315" "316" "321" "322" "323" "324" "325" "326"
## [13] "327" "331" "332" "333" "334" "335" "336" "337" "339"
##
## $descr_categories
## [1] "Food Manufacturing"
## [2] "Beverage and Tobacco Product Manufacturing"
## [3] "Textile Mills"
## [4] "Textile Product Mills"
## [5] "Apparel Manufacturing"
## [6] "Leather and Allied Product Manufacturing"
## [7] "Wood Product Manufacturing"
## [8] "Paper Manufacturing"
## [9] "Printing and Related Support Activities"
## [10] "Petroleum and Coal Products Manufacturing"
## [11] "Chemical Manufacturing"
## [12] "Plastics and Rubber Products Manufacturing"
## [13] "Nonmetallic Mineral Product Manufacturing"
## [14] "Primary Metal Manufacturing"
## [15] "Fabricated Metal Product Manufacturing"
## [16] "Machinery Manufacturing"
## [17] "Computer and Electronic Product Manufacturing"
## [18] "Electrical Equipment, Appliance, and Component Manufacturing"
## [19] "Transportation Equipment Manufacturing"
## [20] "Furniture and Related Product Manufacturing"
## [21] "Miscellaneous Manufacturing (e.g. medical equipment and supplies, jewelry, silverware, sporting goods, toys, etc.)"
##
## $revenue_categories
## [1] "Less than $100K" "$100K-$200K" "$200K-$500K" "$500K-$1M"
## [5] "$1M-$2.5M" "$2.5M-$5M" "$5M-$10M" "$10M-$20M"
## [9] "$20M-$50M" "$50M-$100M" "$100M-$500M" "$500M-$1B"
## [13] "Over $1B"
##
## $empl_size_categories
## [1] "1-4" "5-9" "10-19" "20-49" "50-99"
## [6] "100-249" "250-499" "500-999" "1,000-4,999" "5,000-9,999"
## [11] "Over 10,000"
To take a closer look on the success of these businesses, their yearly growth can be analyzed based on thier counts.
# calculating the yearly growth
yearly_growth <- vi_manuf2018plus %>%
group_by(yr, naics_code_descr) %>%
summarise(number_of_biz = sum(number_of_biz), .groups = "drop") %>%
mutate(growth_rate = paste0(round(((number_of_biz - lag(number_of_biz)) / lag(number_of_biz)) * 100, 4), "%"))
# create a new column with abbreviated labels for easy visualization
yearly_growth$short_descr <- abbreviate(yearly_growth$naics_code_descr)
# print
yearly_growth
## # A tibble: 84 × 5
## yr naics_code_descr number_of_biz growth_rate short_descr
## <dbl> <chr> <dbl> <chr> <chr>
## 1 2018 Apparel Manufacturing 126 NA% AppM
## 2 2018 Beverage and Tobacco Product Man… 504 300% BaTPM
## 3 2018 Chemical Manufacturing 965 91.4683% ChmM
## 4 2018 Computer and Electronic Product … 913 -5.3886% CaEPM
## 5 2018 Electrical Equipment, Appliance,… 450 -50.7119% EEAaCM
## 6 2018 Fabricated Metal Product Manufac… 5798 1188.4444% FMPM
## 7 2018 Food Manufacturing 3589 -38.0993% FdMn
## 8 2018 Furniture and Related Product Ma… 1639 -54.3327% FaRPM
## 9 2018 Leather and Allied Product Manuf… 47 -97.1324% LaAPM
## 10 2018 Machinery Manufacturing 1848 3831.9149% MchM
## # ℹ 74 more rows
Here is a bar plotly plot that visualizes the yearly
growth of all the businesses.
Note: To insure accuracy, rows with NA values are excluded
from yearly_growth. Necessary conversions are done to the
columns so values are used in calculations and visualization. To ensure
visuals visibility, naics_code_descr column categories are
# exclude rows with NA value
yearly_growth <- yearly_growth %>% filter(!is.na(growth_rate))
yearly_growth <- yearly_growth %>% filter(!is.na(number_of_biz))
# convert `growth_rate` column to numeric
yearly_growth$growth_rate <- as.numeric(str_remove(yearly_growth$growth_rate, "%"))
## Warning: NAs introduced by coercion
# convert `yr` column to numeric and `naics_code_descr` column to factor
yearly_growth$naics_code_descr <- as.factor(yearly_growth$naics_code_descr)
# sort the data frame by yr
yearly_growth <- yearly_growth %>% arrange(yr)
# Plotly bar plot
plot_ly(yearly_growth,
x = ~reorder(short_descr, -growth_rate),
y = ~growth_rate,
type = "bar",
marker = list(color = 'red')) %>%
layout(title = "Growth Rate by Category",
xaxis = list(title = "Category", tickangle = -90),
yaxis = list(title = "Growth Rate (%)", tickformat = ".0%"))
## Warning: Ignoring 1 observations
Machinary manufacturing has finally recorded the highest growth rate, while apparel manufacturing recorded the least growth rate.
Here is a similar plotly plot, but a scatter one.
# plotly scatter plot
plot_ly(yearly_growth, x = ~ short_descr, y = ~growth_rate, type = 'scatter', mode = 'markers', marker = list(color = ~ 'red')) %>%
layout(title = "Yearly Growth Rate By Category",
xaxis = list(title = "Category"),
yaxis = list(title = "Growth Rate (%)"))
## Warning: Ignoring 1 observations
Here is a plotly box plot of the businesses yearly growth by category.
# plotly box plot
plot_ly(yearly_growth,
x = ~as.factor(short_descr),
y = ~growth_rate,
type = "box",
color = ~as.factor(yr),
colors = "red") %>%
layout(title = "Yearly Growth Rate By Category",
xaxis = list(title = "Year"),
yaxis = list(title = "Growth Rate (%)"))
## Warning: Ignoring 1 observations
Another way to look at the success of these businesses would be through an analysis of the number of businesses in each category.
# calculate the number of businesses
num_of_biz_analysis <- vi_manuf2018plus %>%
group_by(naics_code_descr) %>%
summarise(total_number_of_biz = sum(number_of_biz)) %>%
arrange(desc(total_number_of_biz))
# create a new column with abbreviated labels for easy visualization
num_of_biz_analysis$short_descr <- abbreviate(num_of_biz_analysis$naics_code_descr)
# print
num_of_biz_analysis
## # A tibble: 21 × 3
## naics_code_descr total_number_of_biz short_descr
## <chr> <dbl> <chr>
## 1 Miscellaneous Manufacturing (e.g. medical eq… 30858 MM(measjss…
## 2 Fabricated Metal Product Manufacturing 23985 FMPM
## 3 Food Manufacturing 15969 FdMn
## 4 Printing and Related Support Activities 11952 PaRSA
## 5 Machinery Manufacturing 7442 MchM
## 6 Furniture and Related Product Manufacturing 6994 FaRPM
## 7 Petroleum and Coal Products Manufacturing 6025 PaCPM
## 8 Chemical Manufacturing 4134 ChmM
## 9 Computer and Electronic Product Manufacturing 3806 CaEPM
## 10 Textile Product Mills 3120 TxPM
## # ℹ 11 more rows
There is the total number of bushinesses of each business categories.
# plotly bar plot
plot_ly(num_of_biz_analysis,
x = ~reorder(short_descr, -total_number_of_biz),
y = ~total_number_of_biz,
type = 'bar',
color = ~short_descr,
colors = 'Dark2') %>%
layout(title = "Number of Business Analysis for Each Category",
xaxis = list(title = "NAICS Code Description", tickangle = -90),
yaxis = list(title = "Number of Businesses"))
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
Miscellaneous Manufacturing here dominates the majority, while Leather and Allied Product Manufacturing is so rare.
Here is a scatter plot that demonstrates how dominant / rare the businesses are.
# plotly scatter plot
plot_ly(num_of_biz_analysis,
x = ~total_number_of_biz,
y = ~reorder(short_descr, -total_number_of_biz),
type = 'scatter',
mode = 'markers',
marker = list(color = 'red')) %>%
layout(title = "Total Number of Businesses by Category",
xaxis = list(title = "Total Number of Businesses"),
yaxis = list(title = "Business Category"))
Here is the same illustration but in a bar plot.
# plotly bar plot
plot_ly(num_of_biz_analysis, x = ~total_number_of_biz, y = ~short_descr, type = 'bar', orientation = 'h', marker = list(color = 'red')) %>%
layout(title = "Total Number of Businesses by Category",
xaxis = list(title = "Total Number of Businesses"),
yaxis = list(title = "Business Category"))
Money! Whenever there is money, there is success. So, showing how successful these business categories can be through revenue per business analysis.
# assign a unique numeric ID to each revenue category
# purpose: to enable calculation of values
vi_manuf2018plus <- vi_manuf2018plus %>%
mutate(revenue_category_id = as.numeric(factor(revenue_category)))
# calculate the revenue per business for each category
vi_manuf2018plus <- vi_manuf2018plus %>%
mutate(revenue_per_biz = ifelse(number_of_biz != 0, revenue_category_id / number_of_biz, 0))
# calculate the average revenue per business for each category
revenue_per_biz <- vi_manuf2018plus %>%
group_by(naics_code_descr) %>%
summarise(avg_revenue_per_biz = mean(revenue_per_biz, na.rm = TRUE)) %>%
arrange(desc(avg_revenue_per_biz))
# create a new column with abbreviated labels for easy visualization
revenue_per_biz$short_descr <- abbreviate(revenue_per_biz$naics_code_descr)
# print
revenue_per_biz
## # A tibble: 21 × 3
## naics_code_descr avg_revenue_per_biz short_descr
## <chr> <dbl> <chr>
## 1 Food Manufacturing 0.101 FdMn
## 2 Primary Metal Manufacturing 0.0843 PrMM
## 3 Machinery Manufacturing 0.0728 MchM
## 4 Textile Mills 0.0722 TxtM
## 5 Nonmetallic Mineral Product Manufacturing 0.0709 NMPM
## 6 Computer and Electronic Product Manufacturing 0.0701 CaEPM
## 7 Furniture and Related Product Manufacturing 0.0696 FaRPM
## 8 Transportation Equipment Manufacturing 0.0695 TrEM
## 9 Chemical Manufacturing 0.0693 ChmM
## 10 Plastics and Rubber Products Manufacturing 0.0662 PaRPM
## # ℹ 11 more rows
# plotly bar plot
plot_ly(revenue_per_biz, x = ~reorder(short_descr, -avg_revenue_per_biz), y = ~avg_revenue_per_biz, type = 'bar', name = 'Average Revenue per Business', marker = list(color = 'red')) %>%
layout(title = "Average Revenue per Business for Each Category",
xaxis = list(title = "NAICS Code Description"),
yaxis = list(title = "Average Revenue per Business"))
It is clear that most of the categories have revenues close to each other, with Food Manufacturing having the highest average revenue and Apparel Manufacturing having the lowest average revenue.
Here is a scatter plot illustrating the business category averages of revenue.
# plotly plot
plot_ly(revenue_per_biz, x = ~short_descr, y = ~avg_revenue_per_biz, type = 'scatter', mode = 'markers', name = 'Average Revenue per Business', marker = list(color = 'red')) %>%
layout(title = "Average Revenue per Business for Each Category",
xaxis = list(title = "NAICS Code Description"),
yaxis = list(title = "Average Revenue per Business"))
Finally, the focus of this project was the exploration and analysis of manufacturing businesses in Houston, as detailed in a data set from the Kinder Institute of Urban Research at Rice University. The data set provided a wealth of insights into various business types and their corresponding revenue categories for the year 2018 - 2021.
To delve into this data, the R programming language was employed, leveraging its powerful packages such as dplyr and ggplot2. These packages are renowned for their capabilities in data science, manipulation, and visualization. Additionally, applications like Shiny and Plotly were used to create interactive and dynamic visualizations. The analysis was centered around three key aspects: the yearly growth of businesses, the total number of businesses in each category, and the average revenue per business for each category.
The analysis unveiled several intriguing trends. For instance, ‘Miscellaneous Manufacturing’ emerged as the most common business category, while ‘Apparel Manufacturing’ was the least common. In terms of growth rate, ‘Machinery Manufacturing’ recorded the highest, indicating a thriving sector. When it came to revenue, most categories were found to have similar revenues, with ‘Food Manufacturing’ leading the pack and ‘Apparel Manufacturing’ at the bottom.
These findings offer valuable insights into the performance and trends of different business categories. They can guide stakeholders, including academics and researchers, in making informed decisions and conducting further studies. The trends highlighted in this analysis are particularly useful for understanding the dynamics of the manufacturing sector in Houston.
While the analysis provides useful insights, it does have some limitations. It assumes the accuracy and currency of the data, which may not always hold true. Also, it does not take into account other potentially influential factors such as geographical location, market conditions, and government policies. Future work could aim to incorporate these factors into the analysis and employ more sophisticated statistical techniques to model and predict trends.